-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[AMD] Add Warp-Pipeline Support, Gluon and LLVM lowering. #8586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
partitioning of the code into stages.
can correctly insert fence.
update interfaces per recent changes make it work actually
fix wrongly offset insertion refactor loop code cleanup
barrier should be inserted from the warp causing the dependency.
Added builtin split_warp_pipeline(), inserting the builtin splits the code region into two pipeline clusters.
Address conflict
now runs on mi350
- polish conversion code - found an important fix needed, just commented for now.
third_party/amd/backend/compiler.py
Outdated
| custom_lds_size = 0 | ||
| amd.passes.ttgpuir.add_optimize_lds_usage(pm, options.arch, custom_lds_size) | ||
| amd.passes.ttgpuir.add_warp_pipeline_conversion(pm) | ||
| passes.common.add_canonicalizer(pm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need another full canonicalization pass here? Might be better to do targeted cleanups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| forOp->setAttr("triton.warp_pipeline.total_stages", | ||
| b.getI32IntegerAttr(totalStages)); | ||
| forOp->setAttr("triton.warp_pipeline.lead_stages", | ||
| b.getI32IntegerAttr(1)); // TODO: make configurable | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do those attributes control?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These don't do anything right now, I'll change them as a unit attribute to identify pipelined scf.for and will consider it again once I got a more concrete idea to use these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, both removed and replaced with .pipelined_for.
| cluster.push_back(op); | ||
| } | ||
| if (!cluster.empty()) | ||
| clusters.push_back(std::move(cluster)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we create the regions directly rather than having a pass post process those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically coming from considering how to program warp-pipeline in Gluon. First, I considered using python function to define a region as like warp-specialization but there were some issues, scf.execute_region doesn't have a block argument and Gluon user doesn't fully know the values required to be yield'ed. It might not be impossible to rewrite a python function into the scf.execute_region but required analysis might be even complicated than just defining clusters by the pipeline borders. Also border-based pipelining method can prevent user from mistakenly locating operations out of the clusters when pipelining.
This is also helpful when we migrate existing block-pingpong scheduling, this pass can be used for non-Gluon pass as well. New auto-partitioning will be directly creating regions, might be able to replace the others but not sure yet.
| void runOnOperation() override { | ||
| ModuleOp m = getOperation(); | ||
| OpBuilder builder(m); | ||
| ModuleAllocation moduleAllocation(m); | ||
|
|
||
| for (auto funcOp : m.getOps<mlir::triton::FuncOp>()) { | ||
| Allocation *allocation = moduleAllocation.getFuncData(funcOp); | ||
| funcOp.walk([&](scf::ForOp forOp) { | ||
| if (auto totalStages = | ||
| forOp->getAttr("triton.warp_pipeline.total_stages")) { | ||
| Location loc = forOp.getLoc(); | ||
| emitPipelinedFor(builder, loc, forOp, allocation); | ||
| } | ||
| }); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't the region be lowered by normal pattern rewrite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could be better idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
- Simplify discardable attr for marking pipeline - Change to use pattern match to convert ops.
region is now inlined in the pass and no longer needed.
missed from merge.
merge from main
Enable Gluon kernels to express and compile warp-pipelined loops—where different warps execute staggered stages (e.g., load, compute, store)—to improve compute–memory overlap and utilization.
This is achieved through a structured, two-phase lowering pipeline:
Future work
block-pingpongand entirely new partitioning pass